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Abstract 

Within a mathematically rigorous model borrowed from statistical learn- 
ing theory, we analyse the curse of dimensionality for similarity based infor- 
mation retrieval in the context of popular indexing schemes: metric trees. 
The datasets X are sampled randomly from a domain 12, equipped with 
a distance, p, and an underlying probability distribution, /i. While per- 
forming an asymptotic analysis, we send the intrinsic dimension d oi f2 to 
infinity, and assume that the size of a dataset, n, grows superpolynomially 
yet subexponentially in d. Exact similarity search refers to finding the near- 
est neighbour in the dataset X to a query point ui ^ Q, where the query 
points are subject to the same probability distribution /x as datapoints. Let 
^ denote a class of all 1-Lipschitz functions on fi that can be used as de- 
cision functions in constructing a hierarchical metric tree indexing scheme. 
Suppose the VC dimension of all sets {w: /(w) > a}, a G M is d'^^^). (In 
view of a result of Goldberg and Jerrum, this is a reasonable complexity as- 
sumption.) We deduce superpolynomial in d lower bounds on the expected 
average case performance of hierarchical metric-tree based indexing schemes 
for exact similarity search in [Q^X). 



Introduction 

Every similarity query in a dataset with n points can be answered in time 
0{n) through a simple linear scan of the dataset, and in practice such a scan 
often outperforms the best known indexing schemes for high-dimensional 
workloads. This is known as the curse of dimensionality, the cf. e.g. Chapter 
9 in ^33j , as well as [HUT] . Paradoxically, there is still no mathematical proof 
that the above phenomenon is in the nature of high-dimensional datasets. 
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While the concept of intrinsic dimension of data is open to a discussion 
(see e.g. pO'), even in cases commonly accepted as "high-dimensional" (e.g. 
uniformly distributed data in the Hamming cube {0,1}'' as d ^ oo), the 
"curse of dimensionality conjecture" for proximity search remains unproven 
[T5] . Diverse results in this direction [51[51[^l51l34| . are still preliminary. 

Here we will verify the conjecture for a particular class of indexing 
schemes widely used in similarity search and going back to |36| : metric trees. 
So are called hierarchical partitioning indexing schemes equipped with 1- 
Lipschitz (non-expanding) decision functions at every node. 

We assume that datapoints are drawn from the domain i? with regard 
to an underlying probability measure independently of each other. The 
domain is a metric space, that is, the similarity measure, p, satisfies the 
axioms of a metric. The intrinsic dimension of f2 is defined in terms of con- 
centration of measure as in [301 . This concept agrees with the usual notion 
of dimension in cases such as the Hamming cube {0, 1}'* or the Euclidean 
ball B'*, and is most relevant. A dataset X C H with n points is modelled 
by i.i.d. random variables distributed according to /i. We assume, as in |15] . 
that the number of datapoints n grows superpolynomially in dimension d yet 
subexponentially in d. Using the notation of asymptotic algorithm analysis, 
this can be written as n = d"'^' and d — uj{\ogn). 

It is clear that the computational complexity of decision functions used 
in constructing a metric tree is a major factor in a scheme performance. 
We take this into account in the form of a combinatorial restriction on the 
subclass ^ of all functions on SI that are allowed to be used as decision 
functions, by requiring a well-known parameter of statistical learning theory, 
the Vapnik-Chervonenkis dimension of ^ [37] . to be polynomial in d, that 
is, VC-dim(^) = 

A very general class of functions satisfying this VC dimension bound is 
provided by a theorem of Goldberg and Jerrum 1121 about function classes 
parametrized by elements of whose computation involves arithmetic op- 
erations, conditioning on inequalities, and inputs or 1. Apparently, the 
decision functions of all indexing schemes used in practice so far in Eu- 
clidean (and Hamming cube) domains fall into this class. 

Under above assumptions, we prove a superpolynomial in d lower bound 
on the expected average performance of all possible metric trees. 

The domains of interest are the Euclidean space M.'^ equipped with the 
gaussian measure 7^^, the cube [0, 1]'' with the uniform measure, the Eu- 
clidean sphere S" with the Haar (Lebesgue) measure, and the Hamming 
cube {0, 1}" with the Hamming distance and the counting measure. In or- 
der to treat all of these simultaneously, we work in a general setting of 
metric spaces with measure. This approach, in our view, helps to stress the 
underlying geometry while ignoring the unnecessary detail. 
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1 General framework for similarity search 

We follow a formalism of [13] as adapted for similarity search p51l31| . A 
workload is a triple W — (J7, X, Q), where i? is the domain, whose elements 
can occur as datapoints and as query points, X C 12 is a finite subset 
{dataset, or instance), and Q C 2^^ is a family of queries. Answering a query 
Q E Q means Hsting all datapoints x E X CiQ. 

A (dis) similarity measure on I? is a function of two arguments p: f2x[2 ^ 
M, which we assume to be a metric, as in [43,. (Although sometimes one 
needs to consider more general similarity measures, cf. jllllSlj .) A range 
similarity query centred at uj E f2 is a, ball of radius e around the query 
point: 

Q = Beiij) = {x E f2: p{uj, x) < s}. 

Equipped with such balls as queries, the triple W — (i7, p, X) forms a range 
similarity workload. 

a 




Fig. 1 A range query. 

We will assume p to be a metric, as in 

The k-nearest neighbours (/c-NN) query centred at a; G i7, where k E N, 
is normally being reduced to a range query of a suitable search radius. 

A workload is inner X = f2 and outer if |X| ^ |i7|. There is a sub- 
stantial difference between the two types of workloads, and most workloads 
of practical interest are outer workloads, that is, a typical query point will 
come from outside the dataset, cf. [5T| . 

2 Hierarchical tree index structures 

An access method is an algorithm that correctly answers every range query. 
Examples of access methods are given by indexing schemes. A hierarchical 
tree-based indexing scheme is a sequence of refining partitions of the domain 
labelled with a finite rooted tree. (For simplicity, we will assume all trees to 
be binary: this is not really restrictive.) Such a scheme takes storage space 
0{n). 

To process a range query Beiuj), we traverse the tree recursively to the 
leaf level. Once a leaf B is reached, its contents (datapoints x E X OB) are 
accessed, and the condition x E B^^uj) verified for each one of them. 
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Let C C J7 be an internal node having child nodes A and B, so that C = 
AuB. A branch descending from B can be pruned provided Bs{uj)nB = 0, 
because then datapoints contained in B are of no further interest. This is the 
case where one can certify that uj is not contained in the e-ncighbourhood 
ofB, 

UJ iB, = {xe n-. d{x, B) < e}. 

(Cf. Fig. [21 l.h.s.) Similarly, if cj ^ A^, then the sub-tree descending from A 
can be pruned. However, if uj belongs to the intersection of e-neighbourhoods 
of A and B, pruning is impossible and the search branches out. (Cf. Fig. |3l 
r.h.s.) 




Fig. 3 Pruning is possible (l.h.s.), and impossible (r.h.s.). 



In order to certify that Be{uj) H B — 9, one employs decision functions. 
A function /: J7 ^ M is a 1-Lipschitz if 



Va;,yer2, \f{x)- f{y)\<d{x,y). 



Assign to every internal mode C a 1-Lipschitz function f = fc so that 
fc \ B < and fc \ A > 0. It is easily seen that fc \ B^ < e, and so the 
fact that 



/c(^) > e 



serves as a certificate for 



Se(w)nB 



assuring 

that a sub-tree descending from B can be pruned. Similarly, if fc{^) ^ 
the sub-tree descending from A can be pruned. 
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Fig. 4 Graph of a decision function f = fc. 

Note that decision functions should have sufficiently low computational 
complexity in order for the indexing scheme to be efficient. 

A hierarchical indexing structure employing 1-Lipschitz decision func- 
tions at every node is known as a metric tree. 

3 Metric trees 

Here is a formal definition. A metric tree for a metric similarity workload 
p, X) consists of 

— a finite binary rooted tree T, 

— a collection of (possibly partially defined) real-valued 1-Lipsehitz func- 
tions ft-Bt — >■ M for every inner node t (decision functions), where 

Bt c n, 

— a collection of bins Bt C Q for every leaf node t, containing pointers to 

elements X Cl Bf, 

so that 

— -Broot(r) = 

— for every internal node t and child nodes t-,t+, one has Bt C Bt_U Bt^ , 

— /* \ Bt_ < 0, r Bt+ > 0. 

When processing a range query B^{ijj), 

— t- is accessed <;=^ /t(<^) < and 

— t+ is accessed ,ft{^^) > —S- 

Here is the search algorithm in pseudocode. 
Algorithm 1 

on input (w,£) do 

set Aq = {root(T)} 
for each i = 0, 1, . . . , depth(7^ — 1 do 
if ^ 

then for each t £ Ai do 
if t is an internal node 
then do 

if ft{oj) < e 
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then A,+i ^ A,+i U{i_} 
if /((w) > -e 
then A,+i ^ A,+i U {t+} 
else for each x ^ Bt do 
if a; e ;Be(i:j) 
then A\J {x} 

return A 

□ 

Under our assumptions on the metric tree, Algorithm[T]correctly answers 
every range similarity query for the workload p, X) and thus is an access 
method. 

For more, see [21], while the survey [5] presents a different perspective. 
Each of the books is an excellent reference to indexing structures 

in metric spaces. 



4 Curse of dimensionality 

In practice, a simple linear scan of the dataset, taking time 0{n), often out- 
performs the best known indexing schemes for high-dimensional workloads, 
though of course there are exceptions, cf. e.g. a relatively efficient scheme 
developed in [3S] for searching large databases of short protein fragments. 
Nevertheless, in recent years the research emphasis has shifted towards ap- 
proximate similarity search: 

— given e > and cj G i7, return a point x Cz X that is [with confidence 
> 1 — (5] at a distance < (1 -I- e)dNN{'-^) from uj. 

This has led to many impressive achievements, particularly |18[|16] . see 
also the survey [T^ and Chapter 7 in [35]. At the same time, research in 
exact similarity search, especially concerning deterministic algorithms, has 
slowed down. At a theoretical level, the following unproved conjecture poses 
a major challenge. 

Conjecture 1 (The curse of dimensionality conjecture, cf. \15f ) Let X C 
{0,1}'^ be a dataset with n points, where the Hamming cube {0,1}'^ is 
equipped with the Hamming (i^) distance: 

d{x,y) = tt{i:xi ^ yj. 

Suppose d — n°^^\ but d — a;(logn). (That is, the number of points in X has 
intermediate growth with regard to the dimension d: it is superpolynomial in 
d, yet subexponential.) Then any data structure for exact nearest neighbour 
search in X , with d'^^^^ query time, must use n'^^^-' space within the cell probe 
model of computation [22] . 
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The best lower bound currently known is 0(c?/ log where s is the 
number of cells used by the data structure [27) . In particular, this implies 
the earlier bound f2(d/ \ogn) for polynomial space data structures [3], as 
well as the bound Sl{d/ \ogd) for near linear space (namely nlog*^*-^-* n). See 
also lUISS]. 



5 Concentration of measure 

As in [10 , we assume the existence of an unknown probability measure /Lt 
on f2, such that both datapoints X and query points uj are being sampled 
with regard to /u. 

On the one hand, this assumption is open to debate: for instance, in a 
typical university library most books (75 % or more) are never borrowed 
a single time, so it is reasonable to assume that the distribution of queries 
in a large dataset will be skewed equally heavily away from data distri- 
bution. On the other hand, there is no obvious alternative way of making 
an apriori assumption about the query distribution, and in some situations 
the assumption makes sense indeed, e.g. in the context of a large biologi- 
cal database where a newly-discovered protein fragment has to be matched 
against every previously known sequence. 

The triple (J7, p, ii) is known in a mathematical context as a metric 
space with measure. This concept opens the way to systematically using the 
phenomenon of concentration of measure on high- dimensional structures, 
also known as the "Geometric Law of Large Numbers. " This phenomenon 
arguably plays at least some role in explaining the nature of the course of 
dimensionality, and can be informally summarized as follows: 

for a typical "high-dimensional" structure J?, if A is a subset contain- 
ing at least half of all points, then the measure of the e- neighbourhood 
of A is overwhelmingly close to 1 already for small £ > 0. 

Here is a rigorous way for dealing with the phenomenon. Define the 
concentration function an oi a metric space with measure f] by 



1 



2 , if £ = 0, 

1 -minj^S {A^):AC [2, /i(j(A) > i} , if £ > 0. 



The value of aa{£) gives un upper bound on the measure of the com- 
plement to the £- neighbourhood A^ of every subset A of measure > 1/2, cf. 
Fig. El 

For example, let H = {0, 1}'' be the Hamming cube of dimension d 
equipped with the normalized Hamming distance 

d{x,y) = ^H^-Xi ^ Vi} 
and the uniform (normalized counting) measure 

MA) = ^. 



8 



V. Pestov 




■ concentration tunction 
Chernoft-Okamoto bound 



0.4 



0.5 



Fig. 5 To the concept of concentration 
function an{(-)- 



Fig. 6 Concentration function 
of {0,1}^° vs Chernoff-Okamoto 
bound. 



Then the concentration function of i? satisfies a gaussian upper estimate 
(ChernofF-Okamoto bound), cf. Fig. [51 



Similar bounds hold for Euclidean spheres cubes I", and many other 
structures of both continuous and discrete mathematics, equipped with suit- 
ably normalized distances and canonical probability measures. The concen- 
tration phenomenon can be expressed by saying that for "typical" high- 
dimensional metric spaces with measure, fi, the concentration function 
ao{e) drops off sharply as dim ^ oo pTlfTO] . 

6 Workload assumptions 

Here are our standing assumptions for the rest of the article. Let (^2,p, /x) 
be a domain equipped with a metric p and a probability measure /z. We 
assume that the expected distance between two points of f2 is normalized 
so as to become asymptotically constant: 



We further assume that f2 has "concentration dimension d" in the sense 
that the concentration function is gaussian with exponent 0{d); 



(This approach to intrinsic dimension is developed in p9ll30j .l 

A dataset X <Z Q contains n points, where n and d are related as follows: 



a{o,i}"(£) < e 



-3e^d/4 



Ep{x,y) = 0{l). 



(1) 




(2) 




(3) 
(4) 
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In other words, asymptotically n grows faster than any polynomial function 
Cd'^, C > 0, fc € N, but slower than any exponential function e^'', c > 0. (An 
example of such rate of growth is n = 2^.) For the purposes of asymptotic 
analysis of search algorithms such assumptions are natural |T5]. 

Datapoints are modelled by a sequence of i.i.d. random variables dis- 
tributed according to the measure /i: 

Xi,X2, . . . , Xn /i. 

The instances of datapoints will be denoted with corresponding lower case 
letters xi,X2, ■ ■ ■ ,Xn- 

Finally, the query centres uj € f2 have the same distribution /i: 

LU ^ fl. 

7 Query radius 

As a well-known concequence of concentration, in high-dimensional domains 
the distance to the nearest neighbour is close to the average distance be- 
tween two points (cf. e.g. [1| for a particular case). We will give a proof of 
this result in the most general situation. Denote e^Ni^) the distance from 
a; e i7 to the nearest point in X. The function enn is 1-Lipschitz, and so 
it concentrates near its median value. From here, one deduces: 

Lemma 1 Under our assumptions on the domain Q and a random sample 
X , with confidence approaching 1 one has for all e 

^{w: \£nn{^) ~^p{x,y)\ > e} < exp(-0(e^d)). 

□ 

Remark 1 The result should be understood in the asymptotic sense, as fol- 
lows. We deal with a family of domains Qd, d G N, and the sampling is 
performed in each of them in an independent fashion, so that "confidence" 
refers to the probability that the infinite sample path belonging to the infi- 
nite product 

n'l^ X QJ^^ X ...X X ... 
satisfies the desired properties. 

The key to the proof is the following technical lemma. 

Lemma 2 (Gromov— Milman [T3j) Let {f2,p,fi) denote a metric space 
with measure and a its concentration function. Then if A C f2 is such that 
l-J-{A) > a(7) for some j > 0, it implies that fJ-{Ay) > 1/2. □ 
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Proof (of Lemma{^ Denote for G ^2 by the distance function to uj. 
Being 1-Lipschitz, p^j concentrates near its median value, which we denote 
by R{Ld). In its turn, the function R: H ^Ris also 1-Lipschitz, and so con- 
centrates around its median value, Rm. The difference between the mean 
and the median of every 1-Lipschitz function on !7 converges to zero uni- 
formly as d — > oo, and for this reason we can substitute Rm for E^,g,^(p) in 
the rest of the proof: indeed, Em — > Ep^p(p) = 1 in the limit d ^ oo. 

Denote Sm the median value of the nearest neighbour radius Enn- Sup- 
pose liminfd-foo em < 1- By proceeding to a subsequence of domains, we 
can select a 7 > so that £j\/ < Rm — 7. If d is sufficiently large, the 
probability that the function R{uj) deviates from Rm by more than 7/2 is 
exponentially small. Since n = |X| only grows subexponentially in d, with 
confidence 1 — exp(— 0(e^c?)) one has for every x ^ X: 

R{x)-eM > |. 

According to Lemma [2l 

(S.m(^)) < «(7/2) = exp{~0{e^d)), 

and so 

fi{X,J < nexp{-0{e^d)) = exp{-0{e^d)), 
a contradiction, since by the definition of sm one must have 

We conclude: liminfrf_j.oo > 1- 

To establish the converse inequality, suffices to notice that a ball of radius 
R{uj) has measure > 1/2, and so if lim inf d_j.oo em is strictly greater than 
1, there will be, with high confidence, points x X for which cm > R{x) 
either, and due to concentration, /i (Xg^^j) will be close to one, meaning the 
complement to the i?(x)-ball is very small, in contradiction to the choice of 
R{x) as the median. □ 



8 A "naive" average 0(n) lower bound 

As a first approximation to our analysis, we present a (flawed) heuristic 
argument, allowing linear in n asymptotic lower bounds on the search per- 
formance of a metric tree. 

First of all, what happens at an internal node C when a metric tree is 
being traversed? Let ac denote the concentration function of C equipped 
with the metric induced from fl and a probability measure fic which is the 
normalized restriction of the measure fi from f2: 

forACC, nc{A)^^^. 
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Suppose for the moment that our tree is perfectly balanced, in the sense 
that iJ,c{^) = fJ-c{B) = \- Then the size of the e-neighbourhood of A is 
at least 1 — ac(e), and the same is true of the e-neighbourhood of B. One 
concludes: for all query points a; £ C except a set of measure < 2ac(e), the 
search algorithm [T] branches out at the node C. (Cf. Fig. [71) 




Lemma 3 Let C be a subset of a metric space with measure /x). De- 

note etc the concentration junction of C with regard to the induced metric 
p \ C and the induced probability measure /i//i(C). Then for all e > 

- -Tier- 



Proof Let e > be any, and let 6 < ac(e)- Then there are subsets D,E C C 
at a distance > e from each other, satisfying /i(-D) > /x(C)/2 and ^i{E) > 
5iJ,{C), in particular the measure of either set is at least Sp,{C). Since the 
e/2-neighbourhoods of D and E in cannot meet by the triangle inequality, 
the complement, F, to at least one of them, taken in 12, has the property 
IJ-{F) > 1/2, while /x(Fg/2) < 1 — 5p,{C), because does not meet one of 
the two original sets, D or E. We conclude: an{£/'2) > Sfi{C), and taking 
suprema over all d < acis), 

an{e/2) > ac{e)p{C), 

that is, acis) < aa{e/2)/fi{C), as required. □ 

Since the size of the indexing scheme is 0{n), a typical size of a set C 
will be on the order !7 (f^^^), while aa{e) will go to zero as o ("■~^)- 

Let a workload (i7,p, X) be indexed with a balanced metric tree of 
depth 0(log?T.), having 0{n) bins of roughly equal size in the sense of the 
probability measure p underlying the datapoint distribution. 
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For at least half of all query points, the distance Snn to the nearest 
neighbour in X is at least as large as Em, the median NN distance. Let ui 
be such a query centre. For every element C of level t partition of i7, one 
has, using Lemmas [3] and [T] and the assumption in Eq. ([2]), 

acisM) < ^^^^ ^ e(2*)e-^(i)4^'^ = e'^^'\ 
/i(C)-i 

where the constants do not depend on a particular internal node C. An 
argument in Section [7] implies that branching at every internal node occurs 
for all Lj except a set of measure 

< tt(nodes) x 2supac(£) = 0{n'^)e-^'-'^'> = o(l), 
c 

because d — w(logn) and so e®*-''-' is superpolynomial in n. Thus, the ex- 
pected average performance of an indexing scheme as above is linear in 
n. 

There are two problems with this argument. Firstly, it has been observed 
and verified experimentally that unbalanced metric tree indexing schemes 
are more efficient than the balanced ones [7ll24) . Secondly and more impor- 
tantly, in our argument we have replaced the value of the empirical measure, 

= ^, 

n 

with the value of the underlying measure fJ.{C), implicitely assuming that 
the two are close to each other: 

But the scheme is being chosen after seeing an instance X, and it is reason- 
able to assume that the choice of indexing partitions will take advantage of 
large random clusters always present in uniformly distributed data. (Fig. [8] 
illustrates this point in dimension d = 2.) Thus, some elements of indexing 
partitions, while having large measure /i, may contains few datapoints, and 
vice versa. 

In order to be able to estimate the empirical measure in terms of the 
underlying distribution, one needs to invoke an approach of statistical learn- 
ing. 



9 Vapnik Chervonenkis theory 

Let ^ be a family of subsets of a set i? (a concept class). One says that 
a subset _B C is shattered by (cf. Fig. ^ if for each C C B there is 
A G such that 

AnB^C. 

The Vapnik-Chervonenkis dimension VC-dim(j2/) of a class £/ is the 
largest cardinality of a set B C n shattered by 

Estimating the VC dimension is often non-trivial, and here are some 
examples. 
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Fig. 8 1000 points randomly and uniformly distributed in the square [0, 1]^. 




Fig. 9 A set B is shattered by the class iz/. 

1 . The VC dimension of the class of all Euclidean balls in R"^ is d + 1. 

2. The class of all parallelepipeds in M.'^ has VC dimension 2d +2. 

3. The VC dimension of the class of all i?^-balls in the Hamming cube 
{0, 1}'* is bounded from above by c? + [log2 d\ . 

(As every ball is determined by its centre and radius, the total number 
of pairwise different balls in {0, 1}'* is c?2'*. Now one uses an obvious 
observation: the VC dimension of a finite concept class jz/ is bounded 
above by logj |^|-) 

Here is a deeper and very general observation. 

Theorem 2 (Goldberg and Jerrum [12j) Consider the parametrized 
class 

^{x^ f{9,x):e e R"} 

for some {0, l}-valued function f . Suppose that, for each input x e R", 
there is an algorithm that computes f{9,x), and this computation takes no 
more than t operations of the following types: 

— the arithmetic operations +, — , x and / on real numbers, 

— jumps conditioned on >, >, <, <, ~, and ^ comparisons of real num- 
bers, and 

— output or 1 . 

Then VC-dim {^) < 4s(i + 2) . □ 
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Now, a typical result of statistical learning theory, where we skip over 
the measurability assumptions (see [2l l37l[39l[20] for details and more). 

Theorem 3 Let £/ C 2^^ be a concept class of finite VC dimension, d. Then 
for all e, (5 > and every probability measure fi on il, if n datapoints in X 
are drawn randomly and independently acoording to fi, then with confidence 

, .^ \xnA\ 



provided n is large enough: 



128 



2e^ 



2e 



n > — [ d\og [ log — + log - 



Let ^ he a, class of (possibly partially defined) real-valued functions on 
J7. Define ^> as the family of all sets of the form 



{uj e domf: f{uj) > a}, a G M. 



The value of VC-dim(^>) is bounded above by the Pollard dimension 
(pseudodimension) of ^ (cf. [39], 4.1.2), but is in general smaller. 

Example 1 (Pivots) If ^ is the class of all distance functions to points of 
R'', then VC-dim(^>) — (The family ^> consists of complements 

to open balls, and the VC dimension is invariant under proceeding to the 
complements.) For the Hamming cube, one has VC-dim {,'^>) < d+ [log2 d\ , 
but we do not know an exact bound. 

Example 2 (vp-tree) The vp-tree [42j uses decision functions of the form 
ftiuj) ^ il/2){p{xt^,iu) ~ pixt_,ij)), 

where t± are two children of t and xt^ are the vantage points for the node 
t. 

If J? = M"^, then ^> consists of all half-spaces, and the VC dimension of 
this family is well known to equal d+1 (in consequence of the classical Radon 
theorem in convex geometry, just like the proof for Euclidean spheres). 

Example 3 (M-tree) The M-tree IT employs decision functions 



ft[u) = p{xt,uj) - sup p{xt,T), 

reBt 



where Bt is a block corresponding to the node i, xt is a datapoint chosen 
for each node t, and suprema on the r.h.s. are precomputed and stored. 
Here the VC dimension estimates are as in Example [T] 
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10 Rigorous lower bounds 

In this Section we prove the following theorem under general assumptions 
of Section [B] 

Theorem 4 Let the domain fl equipped with a metric p and probability 
measure ^ have concentration dimension 0{d) (cf. Eq. ^) and expected 
distance between two points ¥,d(x,y) = 1. Let ^ be a class of all 1-Lipschitz 
functions on the domain f2 that can be used as decision functions for metric 
tree indexing schemes of a given type. Suppose VC-dim{^>) is polynomial 
in d. Let X be an i.i.d. random sample of f2 according to /i, having n points, 
where d = n"*^^' and d = a;(logn). Then, with confidence asymptotically 
approaching I, an optimal metric tree indexing scheme for the similarity 
workload (J7, p, X) has expected average performance c?"^^' . In other words, 
the average search time for an exact nearest neighbour is superpolynomial 
in dimension d. 

The following is an immediate consequence of Lemma 4.2 in |28| . 

Lemma 4 ("Bin Access Lemma") Let e > and m > 4 be such that 
<xn{£) < m^^, and let j be a collection of subsets A C f2 of measure IJ.{A) < 
m^^ each, satisfying p,{[Jj) > 1/2. Then the 2e -neighbourhood of every point 
uj E fi, apart from a set of measure at most ^m~^ , meets at least \m'^ 
elements of ^ . 

Here is the next step in the proof. 

Lemma 5 Denote the class of all subsets B ^ Q appearing as bins of 
metric trees of depth < h built using certification functions from a class .-^ 
satisfying VC-dim{^>) < p. Then 

VC-dim{S§) < 4hp\og{2hp). 



Proof Every _B G ^ is an intersection of a family of < 2/i sets from ^> or 
their complements. Now one uses Th. 4.5 in [S^: if jz/ is a concept class of 
VC dimension < p, then the VC dimension of the class of all sets obtained 
as intersections of < sets from ^ is bounded by 2hp\og{hp). □ 

Let us prove Theorem 2] Without loss in generality, suppose that for any 
fixed value < c < 1 such as e.g. c = 1/4, for all points w except in a set of 
measure < c the depth of the search tree is polynomial in d, uniformly in 
Lo, for otherwise there is nothing to prove. 

Using Eq. ([T]) and Lemma [Tl pick any e' > such that, for sufficiently 
high values of d, for most points w the value of eMN{^) exceeds e' . Let 
< P < 1/2. Again without losing generality, we can assume that the 
measure of the set of query centres w whose e'-neighbourhood meets at 
least one bin with > n^^^^^ points is < 1/4. 
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Combining the two assumptions together, we deduce that for at least 
half of all query centres w the e'-ball around lo only meets bins with fewer 
than n}/'^^^ points. By Theorem|3]and Lemma[5l the value of measure for 
each of these bins is < 2n^/^^^ if n is sufficiently large. Lemma SI applied 
with m = 2n^/^~^ and e — e'/^i implies that for all w from a set of measure 
1 — 0(1) the e'-neighbourhood of w meets at least 0(?i^/^~'^/^) = d^^^'^ bins. 
Since accessing each bin requires at least one operation (let even to check 
that a bin is empty), the theorem is proved. □ 

Combining our Theorem 2] with Theorem [5] of Goldberg and Jerrum 
shows that for all practical purposes the worst-case average performance of 
metric trees is superpolynomial in dimension of the domain. 

Theorem 5 Let the domain Q — be equipped with a probability measure 
fid in such a way that (M'',/id) form a normal Levy family and the fid- 
expected value of the Euclidean distance is 0(1). Let ^d denote a class of 
functions f{9, x) on M'' parametrized with 9 taking values in a space MP°^'^ ^'^'^ 
and such that computing each value f{9,x) takes d^^^^ operations of the 
type described in Thm.\^ Let X be an i.i.d. random sample ofMf^ according 
to fid, having n points, where d = and d — a;(logn). Then, with 

confidence asymptotically approaching 1, an optimal metric tree indexing 
scheme for the similarity workload (f2, p, X) whose decision functions belong 
to the parametrized class ^ has expected average performance d^^^^ . □ 

Three remarks are in order to explain the strength of the above result. 

(1) Measures fid satisfying the above assumption include, for instance, 
the gaussian distribution, the uniform measure on the unit ball, on the unit 
sphere, on the unit cube, etc. 

(2) A polynomial upper bound on the size of the parameter 9 for ^ is 
dictated by the obvious restriction that reading off a parameter of super- 
polynomial length leads to a superpolynomial lower bound on the length of 
computation. 

(3) In the situations of interest, one can verify that the expected number 
of datapoints x E X contained in the smallest query ball meeting X is 0{1). 
For continuous measures on M" such as the gaussian measure or the uniform 
measure on the cube etc., this will be obviously 1. For the Hamming cube, 
the upper limit of this number as c? 00 is bounded by e « 2.7182 . . .. 
Thus, the lower bound does not come from the fact that there are simply 
too many valid near neighbours. 

Discussion: limitations of the method 

The approach to obtaining lower bounds on performance of indexing schemes 
adopted in this paper consists in combining simple concentration of mea- 
sure considerations with the basic techniques of statistical learning (VC 
theory). The argument is applicable to the situation of the following kind. 
Let W — {f2,p,X) denote a similarity workload. An indexing scheme for 
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W consists of a family of real- valued l-Lipschitz functions fi, i & I on i7, 
which are in general partially defined: doni(/i) C f2. Given a query {uj,e), 
where uj € f2 and e > 0, the algorithm chooses recursively a sequence of 
indices in, based on the previous values fi^[uj), k < n. At some point, the 
computation is terminated, and the values /i^. (uj) point at a collection of 
bins, whose contents are read off. The role of the functions fi is to dis- 
card those datapoints (or the entire bins) which cannot possibly answer the 
query. Namely, if — fi{x)\ > e, then, since fi is a l-Lipschitz func- 

tion, one has d{ui,x) > e, and so the point x is irrelevant. All the points 
(or entire bins) which cannot be discarded are returned and their contents 
checked against the condition d{x,uj) < e. 

On the spaces of high dimension, every l-Lipschitz function concentrates 
sharply near its mean (or median) value. If in addition we assume that the 
class ^ of all functions used for a particular indexing scheme has a low 
complexity in the sense of VC dimension, we can conclude that the number 
of points discarded by every function fi drops off fast as dimension d of the 
domain grows, resulting in degrading performance. 

So far, we are aware of essentially two different types of such indexing 
schemes: metric trees (treated in the present paper) and pivot tables [5]. 
For pivots, the methods of the present paper have been subsequently used 
to derive an expected average performance lower bound 0{n/d) [317. It is 
not clear to the author how to state a more general result from which both 
estimates would follow, nor whether such a result would be useful in view 
of lack of other examples. 

Even if the cell-probe model has some formal similarities with the metric 
tree scheme (a hierarchical tree structure, a collection of cells as an indexing 
scheme, computations performed at each node with a limited number of 
cells accessed, etc.), it is not clear whether the partially defined functions 
determined by the algorithm at each node will be l-Lipschitz (they are 
taking values in the Hamming cube) . The examples of implemented indexing 
schemes for exact nearest neighbour search known to this author seem to 
be using l-Lipschitz functions, but of course this does not preclude the 
existence of schemes based on other ideas. 

Furthermore, assuming that an indexing scheme consists of a family of 
l-Lipschitz functions whose values are recursively computed by the algo- 
rithm does not necessarily imply that the role of the functions is reduced to 
certifying that a certain point is not in the e-ball around the query point. As 
an example, consider the indexing scheme based on a walk on the Delaunay 
graph of A" in i7 and called spatial approximation by its author |23| . For 
every datapoint x € X, the scheme stores a list of datapoints whose Voronoi 
cells are adjacent to the cell containing x. At the search phase, a sequence 
of datapoints , 2:2 , . . . , a;„ is chosen, where each Xi+i is the closest point to 
UJ on the list of points Delaunay- adjacent to Xi. If choosing Xi+i so as to get 
closer to w is impossible, one backtracks. In practice, the scheme performs 
on par with the state of the art pivot or metric tree based schemes [5S] . We 
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do not know whether our methods can be employed to prove the curse of 
dimensionahty for this particular scheme in the same general setting. 

It appears that attempting to extend the method to randomized, ap- 
proximated NN search stands no chance either. Firstly, the dimensionality 
reduction-type methods often present in randomized algorithms for approx- 
imate search [T8l[T6l fT| mean that instead of 1-Lipschitz functions, one is 
using what may be called "probably approximately 1-Lipschitz" ones. For 
instance, a random projection from a high-dimensional Euclidean space to 
a subspace of smaller dimension, appropriately rescaled, will have the prop- 
erty that for most pairs of points x, y the distance between them is approx- 
imately preserved, to within a factor of 1 ± e. This property in itself is a 
consequence of concentration of measure, but such maps do not exhibit a 
strong concentration property, rendering our methods inapplicable. 

Chapter 4 in |43j discusses algorithms for approximate similarity search 
based on a traditional metric tree, equipped with 1-Lipschitz decision func- 
tions, but employing agressive pruning, either randomized or deterministic. 
Even here, our proof does not seem to be readily transferable. Indeed, it is 
based on the basic premise that every bin meeting the e -neighbourhood of the 
query point needs to be examined in a deterministic fashion. A randomized 
algorithm, on the contrary, avoids opening bins which are deemed unlikely 
to contain relevant datapoints. Experiments confirm that some of the algo- 
rithms in question perform up to 300 times faster than the corresponding 
algorithms for exact search using the same indexing structure (loc.cit.), 
and provide a circumstantial evidence that the situation here is indeed fun- 
damentally different and possibly not amenable to the same methods of 
analysis. 

Where does this leave us? In the opinion of the referee, 

"...the present lower bound proof might be too specific to the model 
(as opposed to the actual problem)." 

The first part of the statement, at least for the time being, appears to 
be well-founded. It is however less clear whether the curse of dimensional- 
ity conjecture for the Hamming cube is the only theoretical lens possible 
through which to regard the "the actual problem" of the curse of dimen- 
sionality. The model studied in the present article covers a rather wide class 
of popular indexing schemes and obtains superpolynomial lower bounds on 
their performance within mathematically exacting standards of statistical 
learning. While the setting of artificially high-dimensional synthetic i.i.d. 
data fed to the scheme is not realistic either, our results still provide a 
better insight in how these particular schemes do function. 

Indeed, most data practitioners seem to believe that the intrinsic di- 
mension of real-life datasets does not exceed as few as perhaps seven or ten 
dimensions. A deeper understanding of underlying geometry of workloads 
and its interplay with compleixty is called for in order to learn to detect and 
use this low dimensionality efficiently, and asymptotic analysis of algorithm 
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performance in an artificial setting of very high dimensions is contributing 
towards this goal. 
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